19. Backpropagation Through Time (part c)

Last step! Adjusting W_x, the weight matrix connecting the input to the state.

If you took on the previous challenge of deriving the math by yourself first, sit back, fasten your seat belts and compare our notes to yours! Don't worry if you made mistakes, we all do. Your mistakes will help you learn what to avoid next time.

21 RNN BPTT C V7 Final

Gradient calculations needed to adjust W_x

To further understand the BPTT process, we will simplify the unfolded model again. This time the focus will be on the contributions of W_x to the output, the following way:

_Simplified Unfolded model for Adjusting Wx_

Simplified Unfolded model for Adjusting Wx

When calculating the partial derivative of the Loss Function with respect to to W_x we need to consider, again, all of the states contributing to the output. As we saw before, in the case of this example it will be states \bar{s_3} which depend on its predecessor \bar{s_2} which depends on its predecessor \bar{s_1}, the first state.

As we mentioned previously, in BPTT we will take into account each gradient stemming from each state, accumulating all of the contributions.

  • At timestep t=3, the contribution to the gradient stemming from \bar{s_3} is the following :
    (Notice the use of the chain rule here. If you need, go back to the video to visualize the calculation path).

_Equation 43_

Equation 43

  • At timestep t=3, the contribution to the gradient stemming from \bar{s_2} is the following :
    (Notice how the equation, derived by the chain rule, considers the contribution of \bar{s_2} to \bar{s_3} . If you need, go back to the video to visualize the calculation path).

_Equation 44_

Equation 44

  • At timestep t=3, the contribution to the gradient stemming from \bar{s_1} is the following :
    (Notice how the equation, derived by the chain rule, considers the contribution of \bar{s_1} to \bar{s_2} and \bar{s_3} . If you need, go back to the video to visualize the calculation path).

_Equation 45_

Equation 45

After considering the contributions from all three states: \bar{s_3} ,\bar{s_2} and \bar{s_1}, we will accumulate them to find the final gradient calculation.

The following equation is the gradient contributing to the adjustment of W_x using Backpropagation Through Time:

_Equation 46_

Equation 46

As mentioned before, in this example we had 3 time steps to consider, therefore we accumulated three partial derivative calculations. Generally speaking, we can consider multiple timesteps back. If you look closely at equations 1, 2 and 3, you will notice a pattern again. You will find that as we propagate a step back, we have an additional partial derivatives to consider in the chain rule. Mathematically this can be easily written in the following general equation for adjusting W_x using BPTT:

_Equation 47_

Equation 47

Notice the similarities between the calculations of \frac{\partial{E_3} }{\partial W_s} and \frac{\partial{E_3} }{\partial W_x}. Hopefully after understanding the calculation process of \frac{\partial{E_3} }{\partial W_s}, understanding that of \frac{\partial{E_3} }{\partial W_x} was straight forward.